Goto

Collaborating Authors

 regional language


What Do Indonesians Really Need from Language Technology? A Nationwide Survey

Kautsar, Muhammad Dehan Al, Susanto, Lucky, Wijaya, Derry, Koto, Fajri

arXiv.org Artificial Intelligence

There is an emerging effort to develop NLP for Indonesias 700+ local languages, but progress remains costly due to the need for direct engagement with native speakers. However, it is unclear what these language communities truly need from language technology. To address this, we conduct a nationwide survey to assess the actual needs of native speakers in Indonesia. Our findings indicate that addressing language barriers, particularly through machine translation and information retrieval, is the most critical priority. Although there is strong enthusiasm for advancements in language technology, concerns around privacy, bias, and the use of public data for AI training highlight the need for greater transparency and clear communication to support broader AI adoption.


Through the Prism of Culture: Evaluating LLMs' Understanding of Indian Subcultures and Traditions

Chhikara, Garima, Kumar, Abhishek, Chakraborty, Abhijnan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown remarkable advancements but also raise concerns about cultural bias, often reflecting dominant narratives at the expense of under-represented subcultures. In this study, we evaluate the capacity of LLMs to recognize and accurately respond to the Little Traditions within Indian society, encompassing localized cultural practices and subcultures such as caste, kinship, marriage, and religion. Through a series of case studies, we assess whether LLMs can balance the interplay between dominant Great Traditions and localized Little Traditions. We explore various prompting strategies and further investigate whether using prompts in regional languages enhances the models cultural sensitivity and response quality. Our findings reveal that while LLMs demonstrate an ability to articulate cultural nuances, they often struggle to apply this understanding in practical, context-specific scenarios. To the best of our knowledge, this is the first study to analyze LLMs engagement with Indian subcultures, offering critical insights into the challenges of embedding cultural diversity in AI systems.


Survey of Pseudonymization, Abstractive Summarization & Spell Checker for Hindi and Marathi

Ransing, Rasika, Dhamaskar, Mohammed Amaan, Rajpurohit, Ayush, Dhoke, Amey, Dalvi, Sanket

arXiv.org Artificial Intelligence

India's vast linguistic diversity presents unique challenges and opportunities for technological advancement, especially in the realm of Natural Language Processing (NLP). While there has been significant progress in NLP applications for widely spoken languages, the regional languages of India, such as Marathi and Hindi, remain underserved. Research in the field of NLP for Indian regional languages is at a formative stage and holds immense significance. The paper aims to build a platform which enables the user to use various features like text anonymization, abstractive text summarization and spell checking in English, Hindi and Marathi language. The aim of these tools is to serve enterprise and consumer clients who predominantly use Indian Regional Languages.


Talking Like One of Us: Effects of Using Regional Language in a Humanoid Social Robot

Sievers, Thomas, Russwinkel, Nele

arXiv.org Artificial Intelligence

Social robots are becoming more and more perceptible in public service settings. For engaging people in a natural environment a smooth social interaction as well as acceptance by the users are important issues for future successful Human-Robot Interaction (HRI). The type of verbal communication has a special significance here. In this paper we investigate the effects of spoken language varieties of a non-standard / regional language compared to standard language. More precisely we compare a human dialog with a humanoid social robot Pepper where the robot on the one hand is answering in High German and on the other hand in Low German, a regional language that is understood and partly still spoken in the northern parts of Germany. The content of what the robot says remains the same in both variants. We are interested in the effects that these two different ways of robot talk have on human interlocutors who are more or less familiar with Low German in terms of perceived warmth, competence and possible discomfort in conversation against a background of cultural identity. To measure these factors we use the Robotic Social Attributes Scale (RoSAS) on 17 participants with an age ranging from 19 to 61. Our results show that significantly higher warmth is perceived in the Low German version of the conversation.


Evaluating Cultural Awareness of LLMs for Yoruba, Malayalam, and English

Dawson, Fiifi, Mosunmola, Zainab, Pocker, Sahil, Dandekar, Raj Abhijit, Dandekar, Rajat, Panat, Sreedath

arXiv.org Artificial Intelligence

Although LLMs have been extremely effective in a large number of complex tasks, their understanding and functionality for regional languages and cultures are not well studied. In this paper, we explore the ability of various LLMs to comprehend the cultural aspects of two regional languages: Malayalam (state of Kerala, India) and Yoruba (West Africa). Using Hofstede's six cultural dimensions: Power Distance (PDI), Individualism (IDV), Motivation towards Achievement and Success (MAS), Uncertainty Avoidance (UAV), Long Term Orientation (LTO), and Indulgence (IVR), we quantify the cultural awareness of LLM-based responses. We demonstrate that although LLMs show a high cultural similarity for English, they fail to capture the cultural nuances across these 6 metrics for Malayalam and Yoruba. We also highlight the need for large-scale regional language LLM training with culturally enriched datasets. This will have huge implications for enhancing the user experience of chat-based LLMs and also improving the validity of large-scale LLM agent-based market research.


Speech Analysis of Language Varieties in Italy

La Quatra, Moreno, Koudounas, Alkis, Baralis, Elena, Siniscalchi, Sabato Marco

arXiv.org Artificial Intelligence

Italy exhibits rich linguistic diversity across its territory due to the distinct regional languages spoken in different areas. Recent advances in self-supervised learning provide new opportunities to analyze Italy's linguistic varieties using speech data alone. This includes the potential to leverage representations learned from large amounts of data to better examine nuances between closely related linguistic varieties. In this study, we focus on automatically identifying the geographic region of origin of speech samples drawn from Italy's diverse language varieties. We leverage self-supervised learning models to tackle this task and analyze differences and similarities between Italy's regional languages. In doing so, we also seek to uncover new insights into the relationships among these diverse yet closely related varieties, which may help linguists understand their interconnected evolution and regional development over time and space. To improve the discriminative ability of learned representations, we evaluate several supervised contrastive learning objectives, both as pre-training steps and additional fine-tuning objectives. Experimental evidence shows that pre-trained self-supervised models can effectively identify regions from speech recording. Additionally, incorporating contrastive objectives during fine-tuning improves classification accuracy and yields embeddings that distinctly separate regional varieties, demonstrating the value of combining self-supervised pre-training and contrastive learning for this task.


SUKHSANDESH: An Avatar Therapeutic Question Answering Platform for Sexual Education in Rural India

Singh, Salam Michael, Garg, Shubhmoy Kumar, Misra, Amitesh, Seth, Aaditeshwar, Chakraborty, Tanmoy

arXiv.org Artificial Intelligence

Sexual education aims to foster a healthy lifestyle in terms of emotional, mental and social well-being. In countries like India, where adolescents form the largest demographic group, they face significant vulnerabilities concerning sexual health. Unfortunately, sexual education is often stigmatized, creating barriers to providing essential counseling and information to this at-risk population. Consequently, issues such as early pregnancy, unsafe abortions, sexually transmitted infections, and sexual violence become prevalent. Our current proposal aims to provide a safe and trustworthy platform for sexual education to the vulnerable rural Indian population, thereby fostering the healthy and overall growth of the nation. In this regard, we strive towards designing SUKHSANDESH, a multi-staged AI-based Question Answering platform for sexual education tailored to rural India, adhering to safety guardrails and regional language support. By utilizing information retrieval techniques and large language models, SUKHSANDESH will deliver effective responses to user queries. We also propose to anonymise the dataset to mitigate safety measures and set AI guardrails against any harmful or unwanted response generation. Moreover, an innovative feature of our proposal involves integrating ``avatar therapy'' with SUKHSANDESH. This feature will convert AI-generated responses into real-time audio delivered by an animated avatar speaking regional Indian languages. This approach aims to foster empathy and connection, which is particularly beneficial for individuals with limited literacy skills. Partnering with Gram Vaani, an industry leader, we will deploy SUKHSANDESH to address sexual education needs in rural India.


Komodo: A Linguistic Expedition into Indonesia's Regional Languages

Owen, Louis, Tripathi, Vishesh, Kumar, Abhay, Ahmed, Biddwan

arXiv.org Artificial Intelligence

The recent breakthroughs in Large Language Models (LLMs) have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter Large Language Models designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing language models extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in language models, providing to the linguistic needs of diverse communities.


NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural

Wongso, Wilson, Setiawan, David Samuel, Limcorn, Steven, Joyoadikusumo, Ananto

arXiv.org Artificial Intelligence

Indonesia's linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world's most linguistically rich nations. This diversity, coupled with the widespread practice of code-switching and the presence of low-resource regional languages, presents unique challenges for modern pre-trained language models. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia, paving the way for future natural language understanding research for under-represented languages.


Doctors Handwritten Prescription Recognition System In Multi Language Using Deep Learning

G, Pavithiran, Padmanabhan, Sharan, Divya, Nuvvuru, V, Aswathy, P, Irene Jerusha, B, Chandar

arXiv.org Artificial Intelligence

Doctors typically write in incomprehensible handwriting, making it difficult for both the general public and some pharmacists to understand the medications they have prescribed. It is not ideal for them to write the prescription quietly and methodically because they will be dealing with dozens of patients every day and will be swamped with work.As a result, their handwriting is illegible. This may result in reports or prescriptions consisting of short forms and cursive writing that a typical person or pharmacist won't be able to read properly, which will cause prescribed medications to be misspelled. However, some individuals are accustomed to writing prescriptions in regional languages because we all live in an area with a diversity of regional languages. It makes analyzing the content much more challenging. So, in this project, we'll use a recognition system to build a tool that can translate the handwriting of physicians in any language. This system will be made into an application which is fully autonomous in functioning. As the user uploads the prescription image the program will pre-process the image by performing image pre-processing, and word segmentations initially before processing the image for training. And it will be done for every language we require the model to detect. And as of the deduction model will be made using deep learning techniques including CNN, RNN, and LSTM, which are utilized to train the model. To match words from various languages that will be written in the system, Unicode will be used. Furthermore, fuzzy search and market basket analysis are employed to offer an end result that will be optimized from the pharmaceutical database and displayed to the user as a structured output.